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Summary. 

In  this  paper,  we  propose  the  Boosted  Lasso  (BLasso)  algorithm  that  is  able  to  produce  an  ap¬ 
proximation  to  the  complete  regularization  path  for  general  Lasso  problems.  BLasso  is  derived 
as  a  coordinate  descent  method  with  a  fixed  small  step  size  applied  to  the  general  Lasso  loss 
function  (Li  penalized  convex  loss).  It  consists  of  both  a  forward  step  and  a  backward  step 
and  uses  differences  of  functions  instead  of  gradient.  The  forward  step  is  similar  to  Boosting 
and  Forward  Stagewlse  Fitting,  but  the  backward  step  Is  new  and  crucial  for  BLasso  to  approx¬ 
imate  the  Lasso  path  in  all  situations.  For  cases  with  finite  number  of  base  learners,  when  the 
step  size  goes  to  zero,  the  BLasso  path  Is  shown  to  converge  to  the  Lasso  path.  For  nonpara- 
metric  learning  preblems  with  a  large  er  an  infinite  number  ef  base  learners,  BLasso  remains 
valid  since  its  forward  steps  are  Boosting  steps  and  its  backward  steps  only  involve  the  base 
learners  that  are  Included  In  the  model  from  previous  iterations.  Experimental  results  are  also 
provided  to  demonstrate  the  difference  between  BLasso  and  Boosting  or  Forward  Stagewise 
Fitting.  In  addition,  we  extend  BLasso  to  the  case  of  a  general  convex  loss  penalized  by  a 
general  convex  function  and  illustrate  this  extended  BLasse  with  examples. 

Keywords:  Coordinate  Descent;  Backward;  Beosting;  Lasso;  Regularization  Path. 

1.  Introduction 

Lasso  (Tibshirani,  1996)  is  an  important  idea  recently  originated  in  statistics.  It  regularizes 
or  shrinks  a  fitted  model  through  an  Li  penalty  or  constraint.  Its  popularity  can  be  ex¬ 
plained  in  several  ways.  Since  nonparametric  models  that  fit  training  data  well  often  have 
low  biases  but  large  variances,  prediction  accuracies  can  sometimes  be  improved  by  shrink¬ 
ing  a  model  or  making  it  more  sparse.  The  regularization  resulting  from  the  Li  penalty 
leads  to  sparse  solutions,  that  is,  there  are  few  basis  functions  with  nonzero  weights  (among 
all  possible  choices).  This  statement  is  proved  asymptotically  by  Knight  and  Fu  (2000)  and 
for  the  finite  case  by  Donoho  (2004)  in  the  specialized  setting  of  over-complete  libraries 
and  large  under-determined  systems  of  linear  equations.  Furthermore,  the  sparse  models 
induced  by  Lasso  are  more  interpretable  and  often  preferred  in  sciences  and  social  sciences. 

Another  vastly  popular  and  recent  idea  is  Boosting.  Since  its  inception  in  1990  (Schapire, 
1990;  Freund,  1995;  Freund  and  Schapire,  1996),  it  has  become  one  of  the  most  successful 
machine  learning  methods  and  is  now  understood  as  an  iterative  method  leading  to  an 
additive  model  (Breiman,  1998,  Mason  et  ah,  1999  and  Friedman  et  ah,  2001). 

While  it  is  a  natural  idea  to  combine  boosting  and  Lasso  to  have  a  regularized  Boosting 
procedure,  it  is  also  intriguing  that  Boosting,  without  any  additional  regularization,  has 
its  own  resistance  to  overfitting.  For  specific  cases  such  as  L2Boost  (Friedman,  2001), 
this  resistance  is  understood  to  some  extent  (Buhlmann  and  Yu,  2001).  However,  it  was 
not  until  later  when  Forward  Stagewise  Fitting  (FSF)  was  introduced  as  a  boosting  based 
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procedure  with  much  more  cautious  steps  that  a  similarity  between  FSF  and  Lasso  was 
observed  (Hastie  et  ah,  2001,  Efron  et  ah,  2004). 

This  link  between  Lasso  and  FSF  is  formally  described  for  the  linear  regression  case 
through  LARS  (Least  Angle  Regression,  Efron  et  al.  2004).  It  is  also  known  that  for  spe¬ 
cial  cases  (such  as  orthogonal  designs)  FSF  can  approximate  Lasso  path  infinitely  close,  but 
in  general,  it  is  unclear  what  regularization  criteria  FSF  optimizes.  As  can  be  seen  in  our 
experiments  (Figure  1),  FSF  solutions  can  be  significantly  different  from  the  Lasso  solutions 
in  the  case  of  strongly  correlated  predictors  which  are  common  in  high-dimensional  data 
problems.  However,  it  is  still  used  as  an  approximation  to  Lasso  because  it  is  often  com¬ 
putationally  prohibitive  to  solve  Lasso  with  general  loss  functions  for  many  regularization 
parameters  through  Quadratic  Programming. 

In  this  paper,  we  propose  a  new  algorithm  Boosted  Lasso  (BLasso).  It  approximates 
the  Lasso  path  in  all  situations.  The  motivation  comes  from  a  critical  observation  that  both 
FSF  and  Boosting  only  work  in  a  forward  fashion  (so  is  FSF  named).  They  always  take 
steps  that  reduce  empirical  loss  the  most  regardless  of  the  impact  on  model  complexity  (or 
the  Li  penalty  in  the  Lasso  case).  This  often  proves  to  be  too  greedy  -  the  algorithms  are 
not  able  to  correct  mistakes  made  in  early  stages.  Taking  a  coordinate  descent  view  point  of 
the  Lasso  minimization  with  a  fixed  step  size,  we  introduce  an  innovative  “backward”  step. 
This  step  utilizes  the  same  minimization  rule  as  the  forward  step  to  define  each  fitting  stage 
with  an  additional  rule  to  force  the  model  complexity  to  decrease.  By  combining  backward 
and  forward  steps.  Boosted  Lasso  is  able  to  go  back  and  forth  and  approximates  the  Lasso 
path  correctly. 

BLasso  can  be  seen  as  a  marriage  between  two  families  of  successful  methods.  Com¬ 
putationally,  BLasso  works  similarly  to  Boosting  and  FSF.  It  isolates  the  sub-optimization 
problem  at  each  step  from  the  whole  process,  i.e.  in  the  language  of  the  Boosting  litera¬ 
ture,  each  base  learner  is  learned  separately.  This  way  BLasso  can  deal  with  different  loss 
functions  and  large  classes  of  base  learners  like  trees,  wavelets  and  splines  by  fitting  a  base 
learner  at  each  step  and  aggregating  the  base  learners  as  the  algorithm  progresses.  But  un¬ 
like  FSF  which  has  implicit  local  regularization  in  the  sense  that  at  any  iteration,  FSF  with 
a  fixed  small  step  size  only  searches  over  those  models  which  are  one  small  step  away  from 
the  current  one  in  all  possible  directions  corresponding  to  the  base  learners,  BLasso  can  be 
proven  to  converge  to  the  Lasso  solutions  which  have  explicit  global  Li  regularization  for 
cases  with  a  finite  number  of  base  learners.  And  the  algorithm  uses  only  the  differences  of 
the  loss  function  and  basic  operations  without  taking  any  first  or  second  order  derivative 
or  matrix  inversion.  The  fact  that  BLasso  can  be  generalized  to  give  regularized  path  for 
other  convex  penalties  also  comes  as  a  pleasant  surprise  which  further  shows  a  connection 
between  statistical  estimation  methods  and  the  Barrier  Method  (Fiacco  and  McCormick, 
1968,  and  cf.  Boyd  and  Vandenberghe  2004)  in  the  constrained  convex  optimization  lit¬ 
erature.  For  the  original  Lasso  problem,  i.e.  the  least  squares  problem  with  Li  penalty, 
algorithms  that  give  the  entire  Lasso  path  have  been  established  (Shooting  algorithm  by  Fu 
1998  and  LARS  by  Efron  et  al.  2004).  These  algorithms  are  not  suitable  for  nonparametric 
learning  where  the  number  of  base  learners  can  be  very  large  or  infinite.  BLasso  remains 
valid  since  it  treats  each  fitting  step  as  a  sub-optimization  problem.  Even  for  cases  where 
the  number  of  base  learners  is  small,  BLasso  is  still  an  attractive  alternative  to  existing 
algorithms  because  of  its  simple  and  intuitive  nature  and  the  fact  that  it  can  be  stopped 
early  to  find  an  solution  from  the  regularization  path. 

The  rest  of  the  paper  is  organized  as  follows.  A  brief  review  of  Boosting  and  FSF  is 
provided  in  Section  2.1  and  Lasso  in  Section  2.2.  Section  3  introduces  BLasso.  Section  4 
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discusses  the  backward  step  and  gives  the  intuition  behind  BLasso  and  explains  why  FSF  is 
unable  to  give  the  Lasso  path.  Section  5  contains  a  more  detailed  discussion  on  solving  Li 
penalized  least  squares  problem  using  BLasso.  Section  6  discusses  BLasso  for  nonparametric 
learning  problems.  Section  7  introduces  a  Generalized  BLasso  algorithm  which  deals  with 
general  convex  penalties.  In  Section  8,  results  of  experiments  with  both  simulated  and 
real  data  are  reported  to  demonstrate  the  attractiveness  of  BLasso  and  differences  between 
BLasso  and  FSF.  Finally,  Section  9  contains  a  discussion  on  the  choice  of  step  sizes,  a 
summary  of  the  paper,  and  future  research  directions. 

2.  Boosting,  Forward  Stagewise  Fitting  and  the  Lasso 

Boosting  is  an  iterative  fitting  procedure  that  builds  up  a  model  stage  by  stage.  FSF  can 
be  viewed  as  Boosting  with  a  fixed  small  step  size  at  each  stage  and  it  produces  solutions 
that  are  often  close  to  the  Lasso  solutions  (path).  This  section  gives  a  brief  review  of  the 
FSF  and  Boosting  algorithms  followed  by  a  review  of  Lasso. 

2. 1.  Boosting  and  Forward  Stagewise  Fitting 

The  boosting  algorithms  can  be  seen  as  functional  gradient  descent  techniques  (Friedman 
et  al.  2000,  Mason  et  al.  1999).  The  task  is  to  estimate  the  function  F  :  R‘^  ^  R  that 
minimizes  an  expected  loss 

E[C{Y,F{X))],  :  Rx  R^  R+  (1) 

based  on  data  Zi  =  =  l,...,n).  The  univariate  Y  can  be  continuous  (regression 

problem)  or  discrete  (classification  problem).  The  most  prominent  examples  for  the  loss 
function  C(-,  •)  include  Classification  Margin,  Logit  Loss  and  L2  Loss  functions. 

The  family  of  F{-)  being  considered  is  the  set  of  ensembles  of  “base  learners” 

D={F:  F{x)  =  e  R‘^,f3j  G  R},  (2) 

i 

where  the  family  of  base  learners  can  be  very  large  or  contain  infinite  members,  e.g.  trees, 
wavelets  and  splines. 

Let  d  =  {di,  ■■■djj  ■■■Y' j  we  can  reparametrize  the  problem  using 

L{Z,d)-.=  C{Y,F{X)),  (3) 

where  the  specification  of  F  is  hidden  by  L  to  make  our  notation  simpler. 

To  find  an  estimate  for  /?,  we  set  up  an  empirical  minimization  problem: 

n 

P  =  arg  min  E  L(Z,;/3).  (4) 

^  i=l 

Despite  the  fact  that  the  empirical  loss  function  is  often  convex  in  /3,  exact  minimization 
is  usually  a  formidable  task  for  a  moderately  rich  function  family  of  base  learners  and  with 
such  function  families  the  exact  minimization  leads  to  overfitted  models.  Boosting  is  a 
progressive  procedure  that  iteratively  builds  up  the  solution  (and  it  is  often  stopped  early 
to  avoid  overfitting): 
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n 


0,9) 

=  argmin  VL(Zi;/3* -hgly) 

3,9  “ 

2=1 

(5) 

=  + 

(6) 

where  Ij  is  the  jth  standard  basis,  i.e.  the  vector  with  all  Os  except  for  a  1  in  the  jth 
coordinate  and  g  P  R\s  &  step  size  parameter.  However,  because  the  family  of  base  learners 
is  usually  large,  finding  j  is  still  difficult,  for  which  approximate  solutions  are  found.  A 
typical  strategy  that  Boosting  implements  is  by  applying  functional  gradient  descent.  This 
gradient  descent  view  has  been  recognized  and  refined  by  various  authors  including  Breiman 
(1998),  Mason  et  al.  (1999),  Friedman  et  al.  (2000),  Friedman  (2001),  and  Buhlmann  and 
Yu  (2001).  The  well-known  AdaBoost,  LogitBoost  and  L2Boosting  can  all  be  viewed  as 
implementations  of  this  strategy  for  different  loss  functions. 

Forward  Stagewise  Fitting  (FSF)  is  a  similar  method  for  approximating  the  minimization 
problem  described  by  (5)  with  some  additional  regularization.  Instead  of  optimizing  the 
step  size  as  in  (6),  FSF  updates  /3*  by  a  small  fixed  step  size  e  as  in  Friedman  (2001): 

/3*+i  =/3*  +  e.sign(g)l. 

When  FSF  was  introduced  (Hastie  et  al.  2001,  Efron  et  al.  2002),  it  was  only  described 
for  the  L2  regression  setting.  For  general  loss  functions,  it  can  be  defined  by  removing  the 
minimization  over  g  in  (5): 


n 


0,  s)  =  arg  mm  V  L(Z,;  /3*  +  slj), 

(7) 

/3‘+i  =  /3*  +  slj, 

(8) 

This  description  looks  different  from  the  FSF  described  in  Efron  et  al.(2002),  but  the 
underlying  mechanic  of  the  algorithm  remains  unchanged  (see  Section  5) .  Initially  all  coef¬ 
ficients  are  zero.  At  each  successive  step,  a  predictor  or  coordinate  is  selected  that  reduces 
most  the  empirical  loss.  Its  corresponding  coefficient  /3--  is  then  incremented  or  decremented 
by  a  small  amount,  while  all  other  coefficients  (3j,  j  yf  j  are  left  unchanged. 

By  taking  small  steps,  FSF  imposes  some  implicit  regularization.  After  T  <  00  itera¬ 
tions,  many  of  the  coefficients  will  be  zero,  namely  those  that  have  yet  to  be  incremented. 
The  others  will  tend  to  have  absolute  values  smaller  than  the  unregularized  solutions.  This 
shrinkage/sparsity  property  is  reflected  in  the  similarity  between  the  solutions  given  by  FSF 
and  Lasso  which  is  reviewed  next. 


2.2.  Lasso 

Let  T{f3)  denote  the  Li  penalty  of  /3  =  (/3i,  ...,/3y,  ■■■)^,  that  is,  T(/3)  =  ||/3||i  =  |/3j|,  and 

let  F(/3;  A)  denote  the  Lasso  (least  absolute  shrinkage  and  selection  operator)  loss  function 

n 

F(/3;  A)  =  ^L(Z,;/3)+AT(/3). 

2=1 


(9) 


Boosted  Lasso  5 


The  Lasso  estimate  $  =  (/3i,  ...)^  is  defined  by 


/3a  =  mjnr(/3;  A). 

The  parameter  A  >  0  controls  the  amount  of  regularization  applied  to  the  estimate. 
Setting  A  =  0  reverses  the  Lasso  problem  to  minimizing  the  unregularized  empirical  loss. 
On  the  other  hand,  a  very  large  A  will  completely  shrink  /3  to  0  thus  leading  to  the  empty 
or  null  model.  In  general,  moderate  values  of  A  will  cause  shrinkage  of  the  solutions  towards 
0,  and  some  coefficients  may  end  up  being  exactly  0.  This  sparsity  in  Lasso  solutions  has 
been  researched  extensively,  e.g.  Efron  et  al.  (2004)  and  Donoho  (2004).  (Sparsity  can  also 
result  from  other  penalizations  as  in  Gao  and  Bruce,  1997  and  Fan  and  Li,  2001.) 

Computation  of  the  solution  to  the  Lasso  problem  for  a  fixed  A  has  been  studied  for  spe¬ 
cial  cases.  Specifically,  for  least  squares  regression,  it  is  a  quadratic  programming  problem 
with  linear  inequality  constraints;  for  1-norm  SVM,  it  can  be  transformed  into  a  linear  pro¬ 
gramming  problem.  But  to  get  a  model  that  performs  well  on  future  data,  we  need  to  select 
an  appropriate  value  for  the  tuning  parameter  A.  Practical  algorithms  have  been  proposed 
to  give  the  entire  regularization  path  for  the  squared  loss  function  (Shooting  algorithm  by 
Fu,  1998  and  LARS  by  Efron  et  al.,  2004)  and  SVM  (1-norm  SVM  by  Zhu  et  al,  2003). 

But  how  to  give  the  entire  regularization  path  of  the  Lasso  problem  for  general  convex 
loss  function  remained  open.  More  importantly,  existing  methods  are  only  efficient  when 
the  number  of  base  learners  is  small,  that  is,  they  can  not  deal  with  large  numbers  of  base 
learners  that  often  arise  in  nonparametric  learning,  e.g.  trees,  wavelets  and  splines.  FSF 
exists  as  a  compromise  since,  like  Boosting,  it  is  a  nonparametric  learning  algorithm  that 
works  with  different  loss  functions  and  large  numbers  of  base  learners  but  it  only  has  implicit 
local  regularization  and  is  often  too  greedy  comparing  to  Lasso  as  can  be  seen  from  our 
analysis  (Section  4)  and  experiment  (Figure  1). 

Next  we  propose  the  Boosted  Lasso  (BLasso)  algorithm  which  works  in  a  computation¬ 
ally  efficient  fashion  as  FSF.  But  unlike  FSF,  BLasso  is  able  to  approximate  the  Lasso  path 
for  general  convex  loss  functions  infinitely  close. 


3.  Boosted  Lasso 

We  start  with  the  description  of  the  BLasso  algorithm.  A  small  step  size  constant  e  >  0 
and  a  small  tolerance  parameter  ^  >  0  are  needed  to  run  the  algorithm.  The  step  size  e 
controls  how  well  BLasso  approximates  the  Lasso  path.  The  tolerance  ^  controls  how  large 
a  descend  needs  to  be  made  for  a  backward  step  and  is  set  to  be  smaller  than  e  to  have  a 
good  approximation.  When  the  number  of  base  learners  is  finite,  the  tolerance  can  be  set 
to  0  or  a  small  number  to  accommodate  the  numerical  precision  of  the  computer. 

Boosted  Lasso  (BLasso) 

Step  1  (initialization).  Given  data  Zi  =  (V,  A^),  i  =  1,  take  an  initial  forward  step 


(j.s;)  =  arg  rniri 

j.,s—±e 
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Then  calculate  the  initial  regularization  parameter 

.  n  n 

A°  =  -(^L(Z,;0)-^L(Z,;/3°)) 

Set  the  active  index  set  l\  =  {j}.  Set  t  =  0. 

Step  2  (Backward  and  Forward  steps).  Find  the  “backward”  step  that  leads  to  the  minimal 
empirical  loss 

n 

j  =  arg  min  ^  L{Zi;  /3*  +  Sjlj)  where  Sj  =  — sign(/3*)e.  (10) 

^  1  —  1 

Take  the  step  if  it  leads  to  a  decrease  of  moderate  size  (  in  the  Lasso  loss,  otherwise  force 
a  forward  step  (as  (7),  (8)  in  FSF)  and  relax  A  if  necessary.  That  is  if  F(/3*  +  s--ly,  A*)  — 
F(/3‘,  A‘)  <  — then 

/3‘+i  =  + 

A*+i  =  A*, 

IT  =  IaKj},  if/3]+'=0. 

Otherwise, 

n 

(j,s)  =  arg  min  Vl(Z,;/3*  +  s1j),  (11) 

+  (12) 

-  n  n 

V+i  =  min[V,-(^L(Z,;/3‘)-^L(Z,;/3‘+i))], 

IT  =  Ia'SIj}- 

Step  3  (iteration).  Increase  t  by  one  and  repeat  Step  2  and  3.  Stop  when  A*  <  0. 

We  will  discuss  forward  and  backward  steps  in  depth  in  the  next  section.  Immediately, 
the  following  properties  can  be  proved  for  BLasso  (see  Appendix  for  the  proof). 

Lemma  1. 

(a)  For  any  A  >  0,  if  there  exist  j  and  s  with  |s|  =  e  such  that  F(slj;  A)  <  F(0;  A),  we 
have  A°  <  A. 

(b)  For  any  t  such  that  =  A*,  we  have  F(/3‘+^;  A*)  <  F(/3*;  A*). 

(c)  For  ^  =  0  and  any  t  such  that  <  A*,  we  have  F(/3*;  A*)  <  F(/3‘  ±  el^;  A*)  for  every 
j  and  11/3*+^ 111  =  ||/3*||i  +  e. 

Lemma  1  (a)  guarantees  that  it  is  safe  for  BLasso  to  start  with  an  initial  Aq  which  is 
the  largest  A  that  would  allow  an  e  step  away  from  0  (i.e.,  larger  A’s  correspond  to  /3a  =  0). 
Lemma  1  (b)  says  that  for  each  value  of  A,  BLasso  performs  coordinate  descent  until  there 
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is  no  descent  step.  Then,  by  Lemma  1  (c),  the  value  of  A  is  reduced  and  a  forward  step  is 
forced.  Since  the  minimizers  corresponding  to  adjacent  A’s  are  usually  close,  this  procedure 
moves  from  one  solution  to  the  next  within  a  few  steps  and  effectively  approximates  the 
Lasso  path.  Actually,  we  have  a  convergence  result  for  BLasso: 

Theorem  1.  For  finite  number  of  base  learners  and  ^  =  0,  if  L{Z;  (3)  is  strictly  convex  and 
continuously  differentiable  in  (3,  then  as  e  ^  0,  the  BLasso  path  converges  to  the  Lasso  path. 

Many  popular  loss  functions,  e.g.  squared  loss,  logistic  loss  and  negative  log-likelihood 
functions  of  exponential  families,  are  convex  and  continuously  differentiable.  Other  func¬ 
tions  like  the  hinge  loss  (SVM)  is  continuous  and  convex  but  not  differentiable.  In  fact, 
BLasso  does  not  use  any  gradient  or  higher  order  derivatives  but  only  the  differences  of 
the  loss  function.  When  the  function  is  differentiable,  because  of  the  small  fixed  step  size, 
the  differences  are  good  approximations  to  the  gradients,  but  they  are  simpler  to  compute. 
When  the  loss  function  is  not  differentiable,  BLasso  still  runs  because  it  does  not  rely  on 
derivatives.  It  is  theoretically  possible  that  BLasso’s  coordinate  descent  strategy  gets  stuck 
at  nonstationary  points  for  functions  like  the  hinge  loss.  However,  as  illustrated  in  our  third 
experiment,  BLasso  may  still  work  for  1-norm  SVM  empirically. 

Note  that  Theorem  1  does  not  cover  nonparametric  learning  problems  with  an  infinite 
number  of  base  learners.  In  fact,  for  problems  with  a  large  or  an  infinite  number  of  base 
learners,  the  minimization  in  (11)  can  be  carried  out  approximately  by  functional  gradient 
descent,  but  a  tolerance  ^  >  0  needs  to  be  chosen  to  avoid  oscillation  between  forward  and 
backward  steps  caused  by  slow  descending.  We  will  return  to  this  topic  in  Section  6. 

4.  The  Backward  Boosting  Step 

We  now  explain  the  motivation  and  working  mechanic  of  BLasso.  Observe  that  FSF  only 
uses  “forward”  steps,  that  is,  it  only  takes  steps  that  lead  to  a  direct  reduction  of  the 
empirical  loss.  Comparing  to  classical  model  selection  methods  like  Forward  Selection  and 
Backward  Elimination,  Growing  and  Pruning  of  a  classification  tree,  a  “backward”  coun¬ 
terpart  is  missing.  Without  the  backward  step,  FSF  can  be  too  greedy  and  deviate  from 
the  Lasso  path  in  some  cases  (cf.  Figure  1).  This  backward  step  naturally  arises  in  BLasso 
because  of  our  coordinate  descent  view  of  the  minimization  of  the  Lasso  loss. 

For  a  given  /?  yf  0  and  A  >  0,  consider  the  impact  of  a  small  e  >  0  change  of  Pj  to  the 
Lasso  loss  F(/3;  A).  For  an  |s|  =  e, 

n  n 

AjT{Z-,f3)  =  C^L{Z,-,(3  +  sl,)-Y,HZi-,(3))  +  X{T{(3  +  sl,)-T{(3)) 

n 

:=  A,(^L(Z,;/3))  +  AA,T(/3).  (13) 

Since  T{P)  is  simply  the  Li  norm  of  /?,  AT{P)  reduces  to  a  simple  form: 

A,T(/3)  =  \\p  +  sl,\\,-m,  =  \f3j+s\-\f3,\ 

=  sign+(/3y,s)  •  e  (14) 

where  sign’*' (/3j ,  s)  =  1  if  s/Sj  >  0  or  (3j  =  0,  sign'*' (/3j ,  s)  =  —  1  if  <  0  and  sign'*' (/3j ,  s)  = 
0  if  s  =  0. 
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Equation  (14)  shows  that  an  e  step’s  impact  on  the  penalty  is  a  fixed  e  for  any  j.  Only 
the  sign  of  the  impact  may  vary.  Suppose  given  a  j3,  the  “forward”  steps  for  different 
j  have  impacts  on  the  penalty  of  the  same  sign,  then  AjT  is  a  constant  in  (13)  for  all 
j.  Thus,  minimizing  the  Lasso  loss  using  fixed-size  steps  is  equivalent  to  minimizing  the 
empirical  loss  directly.  At  the  “early”  stages  of  FSF,  all  forward  steps  are  parting  from 
zero,  therefore  all  the  signs  of  the  “forward”  steps’  impact  on  penalty  are  positive.  As  the 
algorithm  proceeds  into  “later”  stages,  some  of  the  signs  may  change  into  negative  and 
minimizing  the  empirical  loss  is  no  longer  equivalent  to  minimizing  the  Lasso  loss.  Hence 
in  the  beginning,  FSF  carries  out  a  steepest  descent  algorithm  that  minimizes  the  Lasso 
loss  and  follows  Lasso’s  regularization  path,  but  as  it  goes  into  later  stages,  the  equivalence 
is  broken  and  they  part  ways.  In  fact,  except  for  special  cases  like  orthogonal  designed 
covariates,  FSF  usually  go  into  “later”  stages,  then  the  signs  of  impacts  on  penalty  on 
some  directions  can  change  from  positive  to  negative.  These  directions  then  reduce  the 
empirical  loss  and  penalty  simultaneously  therefore  they  should  be  preferred  over  other 
directions.  Moreover,  there  can  also  be  occasions  where  a  step  goes  “backward”  to  reduce 
the  penalty  with  a  small  sacrifice  in  empirical  loss.  In  general,  to  minimize  the  Lasso  loss, 
one  needs  to  go  “back  and  forth”  to  trade  off  the  penalty  with  empirical  loss  for  different 
regularization  parameters.  We  call  a  direction  that  leads  to  reduction  of  the  penalty  a 
“backward”  direction  and  define  a  backward  step  as  the  following: 

For  a  given  (3,  a  backward  step  is  such  that: 

A/3  =  Sjlj,  subject  to  $j  yf  0,  sign(s)  =  — sign(/3j)  and  |s|  =  e. 

Making  such  a  step  will  reduce  the  penalty  by  a  fixed  amount  A  •  e,  but  its  impact  on 
the  empirical  loss  can  be  different,  therefore  as  in  (10)  we  want: 

n 

j  =  argmin  >  L{Zi-,  $  +  Sjlj)  subject  to  $j  yf  0  and  Sj  =  — sign(/3j)e, 

1—1 

i.e.  j  is  picked  such  that  the  empirical  loss  after  making  the  step  is  as  small  as  possible. 

While  forward  steps  try  to  reduce  the  Lasso  loss  through  minimizing  the  empirical  loss, 
the  backward  steps  try  to  reduce  the  Lasso  loss  through  minimizing  the  Lasso  penalty. 
Although  rare,  it  is  possible  to  have  a  step  reduce  both  the  empirical  loss  and  the  Lasso 
penalty.  It  thus  works  both  as  a  forward  step  and  a  backward  step.  We  do  not  distinguish 
such  steps  as  they  do  not  create  any  confusions. 

In  summary,  by  allowing  the  backward  steps,  we  are  able  to  work  with  the  Lasso  loss 
directly  and  take  backward  steps  to  correct  earlier  forward  steps  that  are  too  greedy. 

5.  Least  Squares  Problem 

For  the  most  common  particular  case  -  least  squares  regression,  the  forward  and  backward 
steps  in  BLasso  become  very  simple  and  more  intuitive.  To  see  this,  we  write  out  the 
empirical  loss  function  L{Zi]  /3)  in  its  form, 

n  n  n  n 

Y,  HZp,  /3)  =  ^(r,  -  XiPf  =  ^(y,  -  =  Y  vl  (i5) 

i—1  i—1  i—1  i—1 

where  Y  =  (Yi,  ...,y„)^  are  the  “fitted  values”  and  rj  =  {rji,  ...,r]n)'^  are  the  “residuals”. 
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Recall  that  in  a  penalized  regression  setup  Xi  =  {Xu,  ...,Xim),  every  covariates  X^  = 
{Xij,...,Xnj)'^  is  normalized,  i.e.  =  1  ^^d  J27=i^v  =  0-  For  a  given 

(3  =  (Pi,  ...PmY' ,  the  impact  of  a  step  s  of  size  |s|  =  e  along  Pj  on  the  empirical  loss  function 
can  be  written  as: 


A(^L(Z,;/?)) 

2=1 


-  XPP  +  si,))"  -  (R.  -  Xipf] 

2=1 


-  sXdjf 

2=1 


-2s{tj  ■  X^)  +  . 


=  '^{-‘^srj.Xy  +  s^Xfj) 
2=1 


(16) 


The  last  line  of  these  equations  delivers  a  strong  message  -  in  least  squares  regression, 
given  the  step  size,  the  impact  on  the  empirical  loss  function  is  solely  determined  by  the 
inner-product  (correlation)  between  the  fitted  residuals  and  each  coordinate.  Specifically,  it 
is  proportional  to  the  negative  inner-product  between  the  fitted  residuals  and  the  covariate 
plus  the  step  size  squared.  Therefore  coordinate  steepest  descent  with  a  fixed  step  size  on 
the  empirical  loss  function  is  equivalent  to  finding  the  covariate  that  is  best  correlated  with 
the  fitted  residuals  at  each  step,  then  proceed  along  the  same  direction.  This  is  in  principle 
the  same  as  FSF. 

Translate  this  for  the  forward  step  where  originally 

71 

(h  s;)  =  arg  mm  V  L(Z,;  P  +  si,), 

j,s=±e  '  ^ 


we  get 


j  =  arg  max  \rj  ■  X^  \  and  s  =  sign(?7  •  X^)e, 
j 


(17) 


which  coincides  exactly  with  the  stagewise  procedure  described  in  Efron  (2002)  and  is  in 
general  the  same  principle  as  L2  Boosting,  i.e.  recursively  refitting  the  regression  residuals 
along  the  most  correlated  direction  except  the  difference  in  step  size  choice  (Friedman  2001, 
Buhlmann  and  Yu  2003).  Also,  under  this  simplification,  a  backward  step  becomes 


j  =  argmin(— s(ry  ■ 
j 


X^))  subject  to  Pj  yf  0  and  s,  =  — sign(/3,)e. 


(18) 


Clearly  that  both  forward  and  backward  steps  are  based  only  on  the  correlations  be¬ 
tween  fitted  residuals  and  the  covariates.  It  follows  that  BLasso  in  the  L2  case  reduces  to 
finding  the  best  direction  in  both  forward  and  backward  directions  by  examining  the  inner- 
products,  and  then  deciding  whether  to  go  forward  or  backward  based  on  the  regularization 
parameter.  This  not  only  simplifies  the  minimization  procedure  but  also  significantly  re¬ 
duces  the  computation  complexity  for  large  datasets  since  the  inner-product  between  rjt  and 
Xj  can  be  updated  by 

(r;*+i)%  =  (r;*  -  sA.,)%-  =  (rfyX,  -  sX'.^X,.  (19) 

which  takes  only  one  operation  if  X'^Xj  is  precalculated. 

Therefore,  when  the  number  of  base  learners  is  small,  based  on  precalculated  X'X  and 
Y'X,  BLasso  without  the  original  data  by  using  (19)  which  makes  its  computation  com¬ 
plexity  independent  from  the  number  of  observations.  When  the  number  of  base  learners 
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is  large  (or  infinite),  this  strategy  because  inefficient  (or  unsuitable).  The  forward  step  is 
then  done  by  a  sub-optimization  procedure  e.g.  fitting  smoothing  splines.  For  the  back¬ 
ward  step,  only  inner-products  between  base  learners  with  nonzero  coefficients  need  to  be 
calculated  once.  Then  the  inner  products  between  these  base  learners  and  residuals  can  be 
updated  by  (19).  This  makes  the  backward  steps’  computation  complexity  proportional  to 
the  number  of  base  learners  that  are  already  chosen  instead  of  the  number  of  all  possible 
base  learners.  Therefore  BLasso  works  efficiently  not  only  for  cases  with  large  sample  size 
but  also  for  cases  where  a  class  of  a  large  or  an  infinite  number  of  possible  base  learners  is 
given  but  we  only  look  for  a  sparse  model. 

As  mentioned  earlier,  there  are  already  established  algorithms  for  solving  the  least  square 
Lasso  problem  e.g.  Shooting  algorithm  (Fu  1998),  and  LARS  (Efron  et  al.  2004).  These 
algorithms  are  efficient  for  giving  the  exact  Lasso  path  when  the  number  of  base  learners  is 
small  but  neither  algorithm  is  adequate  for  nonparametric  learning  problems  with  a  large  or 
an  infinite  number  of  base  learners.  Also,  given  the  entire  Lasso  path,  one  solution  needs  to 
be  selected.  Previous  algorithms  suggest  using  (generalized)  cross  validation  procedures  that 
minimized  a  (generalized)  cross  validation  loss  over  a  grid  of  A.  BLasso,  as  these  strategies 
still  apply,  can  also  take  advantage  of  early  stopping  as  in  the  original  boosting  literature 
and  potentially  saves  computation  from  further  unnecessary  iterations.  Particularly,  in  cases 
where  the  Ordinary  Least  Square  (OLS)  model  performs  well,  BLasso  can  be  modified 
to  start  from  the  OLS  model,  go  backward  and  stop  in  a  few  iterations.  Overall,  for 
least  squares  problem,  BLasso  is  an  algorithm  with  Boosting’s  computational  efficiency  and 
flexibility  but  also  produces  Lasso  solutions  as  LARS  and  Shooting  algorithms. 

6.  Boosted  Lasso  for  Nonparametric  Learning 

As  we  discussed  for  the  least  squares  problem.  Boosted  Lasso  can  be  used  to  deal  with 
nonparametric  learning  problems  with  a  large  or  an  infinite  number  of  base  learners.  This 
is  because  Boosted  Lasso  has  two  types  of  steps:  forward  and  backward,  for  both  of  which 
a  large  or  an  infinite  number  of  base  learners  does  not  create  a  problem. 

The  forward  steps  are  similar  to  the  steps  in  Boosting  which  can  handle  a  large  or 
an  infinite  number  of  base  learners.  That  is,  the  forward  step,  as  in  Boosting,  is  a  sub- 
optimization  problem  by  itself  and  Boosting’s  functional  gradient  descend  strategy  applies 
and  this  functional  gradient  strategy  works  even  with  a  large  or  an  infinite  number  of  base 
learners.  For  example,  in  the  case  of  classification  with  trees,  one  can  use  the  classification 
margin  or  the  logistic  loss  function  as  the  loss  function  and  use  a  reweighting  procedure  to 
find  the  appropriate  tree  at  each  step  (for  details  see  e.g.  Breiman  1998  and  Friedman  et 
al.  2000).  In  case  of  regression  with  the  loss  function,  the  minimization  as  in  (11)  is 
equivalent  to  refitting  the  residuals  as  we  described  in  the  last  section. 

The  backward  steps  consider  only  those  base  learners  already  in  the  model  and  the 
number  of  these  learners  is  always  finite.  This  is  because  for  an  iterative  procedure  like 
Boosted  Lasso,  we  usually  stop  early  to  avoid  overfitting  and  to  get  a  sparse  model.  This 
results  in  a  finite  number  of  base  learners  altogether.  And  even  if  the  algorithm  is  kept 
running,  it  usually  reaches  a  close-to-perfect  fit  without  too  many  iterations  (i.e.  the  “strong 
learnability”  in  Schapire  1990  ).  Therefore,  the  backward  step’s  computation  complexity  is 
limited  because  it  only  involves  base  learners  that  are  already  included  from  previous  steps. 

There  is,  however,  a  difference  in  the  BLasso  algorithm  between  the  case  with  a  small 
number  of  base  learners  and  the  one  with  a  large  or  an  infinite  number  of  base  learners.  For 
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the  finite  case,  since  BLasso  requires  a  backward  step  to  be  strictly  descending  and  relax 
A  whenever  no  descending  step  is  available,  therefore  BLasso  never  reach  the  same  solution 
more  than  once  and  the  tolerance  constant  ^  can  be  set  to  0  or  a  very  small  number 
to  accommodate  the  program’s  numerical  accuracy.  In  the  nonparametric  learning  case, 
oscillation  could  occur  when  BLasso  keeps  going  back  and  forth  in  different  directions  but 
only  improving  the  penalized  loss  function  by  a  diminishing  amount,  therefore  a  positive 
tolerance  ^  is  mandatory.  Since  a  step’s  impact  on  the  penalized  loss  function  is  on  the 
order  of  e,  we  suggest  using  ^  =  c  •  e  where  c  G  (0, 1)  is  a  small  constant. 


7.  Generalized  Boosted  Lasso 

As  stated  earlier,  BLasso  not  only  works  for  general  convex  loss  functions,  but  also  extends  to 
convex  penalties  other  than  the  Li  penalty.  For  the  Lasso  problem,  BLasso  does  a  fixed  step 
size  coordinate  descent  to  minimize  the  penalized  loss.  Since  the  penalty  has  the  special  Li 
norm  and  (14)  holds,  a  step’s  impact  on  the  penalty  has  a  fixed  size  e  with  either  a  positive  or 
a  negative  sign,  and  the  coordinate  descent  takes  form  of  “backward”  and  “forward”  steps. 
This  reduces  the  minimization  of  the  penalized  loss  function  to  unregularized  minimizations 
of  the  loss  function  as  in  (11)  and  (10).  For  general  convex  penalties,  since  a  step  on  different 
coordinates  does  not  necessarily  have  the  same  impact  on  the  penalty,  one  is  forced  to  work 
with  the  penalized  function  directly.  Assume  T(f3):  i?™  —^Risa,  convex  penalty  function. 
We  now  describe  the  Generalized  Boosted  Lasso  algorithm: 

Generalized  Boosted  Lasso 


Step  1  (initialization).  Given  data  Zi  =  {Yi,Xi),  i  =  l,...,n  and  a  fixed  small  step  size 
e  >  0  and  a  small  tolerance  parameter  ^  >  0,  take  an  initial  forward  step 

n 

{j,s-A  =  arg  mm  Vl(Z,;s1j),  /3°  =  s-lp 

i—\ 


Then  calculate  the  corresponding  regularization  parameter 

r(/30)  -  r(o) 


Set  t  =  0. 

Step  2  (steepest  descent  on  Lasso  loss).  Find  the  steepest  coordinate  descent  direction  on 
the  penalized  loss 

(j,  A)  =  arg  min  F(/3*  +  sl^;  A*). 

j,s—±e 

Update  (3  if  it  reduces  Lasso  loss;  otherwise  force  /3  to  minimize  L  and  recalculate  the 
regularization  parameter.  That  is  if  F(/3‘  +  Sjlj;  A*)  —  F(/3*,  A*)  <  — then 


pt+i  )t+i 


n 


arg  min 
i.l«l=e 


L{Zi-,  /3*  +  slj). 


Otherwise 


1 2  Peng  Zhao  et  al. 


/3*+i  =  P^  +  s-l., 

T(/3‘+i)  -  T(/3*) 

3  (iteration).  Increase  t  by  one  and  repeat  Step  2  and  3.  Stop  when  A*  <  0. 


In  the  Generalized  Boosted  Lasso  algorithm,  explicit  “forward”  or  “backward”  steps  are 
no  longer  seen.  However,  the  mechanic  remains  the  same  -  minimize  the  penalized  loss 
function  for  each  A,  relax  the  regularization  by  reducing  A  through  a  “forward”  step  when 
the  minimum  of  the  loss  function  for  the  current  A  is  reached. 

Another  method  that  works  in  a  similar  fashion  is  the  Barrier  Method  proposed  by 
Fiacco  and  McCormick  (1968)  (cf.  Boyd  and  Vandenberghe,  2004).  It  is  an  advanced 
optimization  method  for  constrained  convex  minimization  problems  which  solves  a  sequence 
of  unconstrained  minimization  problems  that  approach  the  constrained  problem.  We  note 
that  BLasso  also  solves  a  sequence  of  optimization  problems  indexed  by  A.  Both  methods 
use  the  last  solution  found  as  the  starting  point  for  the  next  unconstrained  minimization 
problem.  However,  there  exists  a  fundamental  difference:  the  Barrier  Method  targets  to  get 
to  the  end  of  the  path  (A  =  0)  so  it  jumps  quickly  from  solution  to  solution  according  to  a 
geometric  schedule.  In  comparison,  BLasso  is  designed  to  produce  a  well  sampled  solution 
path.  It  therefore  places  solution  points  evenly  on  the  path  (more  discussions  will  be  offered 
when  we  compare  BLasso  with  the  algorithm  in  Rosset  (2004)  in  the  discussion  section). 
This  subtle  connection  between  BLasso  and  the  extensively  studied  Barrier  Method  provides 
opportunities  to  explore  further  convergence  properties  of  BLasso  and  to  study  the  interplay 
between  optimization  and  statistical  estimation  theories. 


8.  Experiments 

In  this  section,  three  experiments  are  carried  out  to  implement  BLasso.  The  first  experiment 
runs  BLasso  on  the  diabetes  dataset  (cf.  Efron  et  al.  2004)  with  an  added  artificial  covariate 
variable  under  the  classical  Lasso  setting,  i.e.  L2  regression  with  an  Li  penalty.  This  added 
covariate  is  strongly  correlated  with  a  couple  of  the  original  covariates.  In  this  case,  BLasso 
is  seen  to  produce  a  path  almost  exactly  the  same  as  the  Lasso  path,  while  FSF’s  path  parts 
drastically  from  Lasso’s  due  to  the  added  strongly  correlated  covariate. 

In  the  second  experiment,  we  use  the  diabetes  dataset  for  L2  regression  without  the 
added  covariate  but  with  several  different  penalties,  therefore  Generalized  BLasso  is  used. 
The  penalties  are  bridge  penalties  (cf.  Fu  1998)  with  different  bridge  parameters  7=1, 
7  =  1.1,  7  =  2,  7  =  4  and  7  =  00.  Generalized  BLasso  produced  the  solution  paths 
successfully  for  all  cases. 

The  last  experiment  is  on  classification  where  we  use  simulated  data  to  illustrate  BLasso’s 
solving  regularized  classification  problem  under  the  1-norm  SVM  setting.  Here,  since  the 
loss  function  is  not  continuously  differentiable,  it  is  theoretically  possible  that  BLasso  does 
not  converge  to  the  true  regularization  path.  However,  as  we  illustrate,  BLasso  runs  without 
a  problem  and  produces  reasonable  solutions. 


8. 1.  L2  Regression  with  Li  Penaity  (Ciassicai  Lasso) 

The  dataset  used  in  this  experiment  is  from  a  Diabetes  study  where  442  diabetes  patients 
were  measured  on  10  baseline  variables.  A  prediction  model  was  desired  for  the  response 
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variable,  a  quantitative  measure  of  disease  progression  one  year  after  baseline.  One  addi¬ 
tional  variable,  =  —X'^  +  X®  -|-  -|-  e  where  e  is  i.i.d.  Gaussian  noise  (mean  zero  and 

Var(e)  =  1/442),  is  added  to  make  the  difference  between  FSF  and  Lasso  solutions  more 
visible.  This  added  covariate  is  strongly  correlated  with  X^,  with  correlations  as  follows  (in 
the  order  of  X\  X^, X^°)  : 

(0.25  ,  0.24  ,  0.47  ,  0.39  ,  0.48  ,  0.38  ,  -0.58  ,  0.76  ,  0.94  ,  0.47). 

The  classical  Lasso  -  L2  regression  with  Li  penalty  is  used  for  this  purpose.  Let 
X^ ,  X^ , X"^  be  n— vectors  representing  the  covariates  and  Y  the  vector  of  responses 
for  the  n  cases,  m  =  11  and  n  =  442  in  this  study.  Location  and  scale  transformations 
are  made  so  that  all  covariates  are  standardized  to  have  mean  0  and  unit  length,  and  the 
response  has  mean  zero. 

The  penalized  loss  function  has  the  form: 


r(/3;  \)=Y,{Y,-X,(3f 


(20) 


2=1 


Lasso 


BLasso 


FSF 


t  =  E  l/3t 


i  =  E  l/3t 


t  =  E  l/3j 


Figure  1.  Estimates  of  regression  coefficients  (3j,  j=l,2,...10,ll  for  the  diabetes  data.  Left 
Panel:  Lasso  solutions  (produced  using  simplex  search  method  on  the  penalized  empirical 
loss  function  for  each  A)  as  a  function  of  t  =  ||/3||i.  Middle  Panel:  BLasso  solutions,  which 
can  be  seen  indistinguishable  to  the  Lasso  solutions.  Right  Panel:  FSF  solutions,  which  are 
different  from  Lasso  and  BLasso. 


The  middle  panel  of  Figure  1  shows  the  coefficient  plot  for  BLasso  applied  to  the  modified 
diabetes  data.  Left  (Lasso)  and  Middle  (Middle)  panels  are  indistinguishable  from  each 
other.  Both  FSF  and  BLasso  pick  up  the  added  artificial  and  strongly  correlated  X^^  (the 
solid  line)  in  the  earlier  stages,  but  due  to  the  greedy  nature  of  FSF,  it  is  not  able  to  remove 
X^^  in  the  later  stages  thus  every  parameter  estimate  is  affected  leading  to  significantly 
different  solutions  from  Lasso. 

The  BLasso  solutions  were  built  up  in  8700  steps  (making  the  step  size  e  =  0.5  small  so 
that  the  coefficient  paths  are  smooth),  840  of  which  were  backward  steps.  In  comparison, 
FSF  took  7300  pure  forward  steps.  BLasso’s  backward  steps  mainly  concentrate  around  the 
steps  where  FSF  and  BLasso  tend  to  differ. 
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8.2.  L2  Regression  with  Penaities  (the  Bridge  Regression) 

To  demonstrate  the  Generalized  BLasso  algorithm  or  different  penalties.  We  use  the  Bridge 
Regression  setting  with  the  diabetes  dataset  (without  the  added  covariate  in  the  first  exper¬ 
iment).  The  Bridge  Regression  method  was  first  proposed  by  Frank  and  Friedman  (1993) 
and  later  more  carefully  discussed  and  implemented  by  Fu  (1998).  It  is  a  generalization 
of  the  ridge  regression  (L2  penalty)  and  Lasso  (Li  penalty).  It  considers  a  linear  (L2) 
regression  problem  with  Lj  penalty  for  7  >  1  (to  maintain  the  convexity  of  the  penalty 
function). 

The  penalized  loss  function  has  the  form: 

n 

F(/3;A)  =  ^(y,-W/3)"  +  A||/3|U  (21) 

i=l 

where  7  is  the  bridge  parameter.  The  data  used  in  this  experiment  are  centered  and  rescaled 
as  in  the  first  experiment. 


7  =  00 


100  400  700 

1  — ~ 

0 - 

-1 

-1  0  1 


Figure  2.  Upper  Panel:  Solution  paths  produced  by  BLasso  for  different  bridge  parameters. 
,  from  left  to  right:  Lasso  (7  =  1),  near-Lasso  (7  =  1.1),  Ridge  (7  =  2),  over-Ridge  (7  =  4), 
max  (7  =  00).  The  F-axis  is  the  parameter  estimate  and  has  the  range  [—800,  800].  The  X- 
axis  for  each  of  the  left  4  plots  is  ^  ■  \(3i\,  the  one  for  the  5th  plot  is  max(|/3i|)  because  \(}i\ 
is  unsuitable.  Lower  Panel:  The  corresponding  penalty  equal  contours  for  dj  +  =  1. 

Generalized  BLasso  successfully  produced  the  paths  for  all  5  cases  which  are  verified  by 
pointwise  minimization  using  simplex  method  (7  =  1,  7  =  1.1,  7  =  4  and  7  =  max)  or 
close  form  solutions  (7  =  2).  It  is  interesting  to  notice  the  phase  transition  from  the  near- 
Lasso  to  the  Lasso  as  the  solution  paths  are  similar  but  only  Lasso  has  sparsity.  Also,  as  7 
grows  larger,  estimates  for  different  (3j  tend  to  have  more  similar  sizes  and  in  the  extreme 
7  =  00  there  is  a  “branching”  phenomenon  -  the  estimates  stay  together  in  the  beginning 
and  branch  out  into  different  directions  as  the  path  progresses.  Since  no  differentiation 
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is  carried  out  or  assumed,  Generalized  BLasso  produced  all  of  these  significantly  different 
solution  paths  by  simply  plugging  in  different  penalty  functions  in  the  algorithm. 


8.3.  Classification  with  1-norm  SVM  (Hinge  Loss) 

To  demonstrate  the  Generalized  BLasso  algorithm  for  a  nondifferentiable  loss  function,  we 
now  look  at  binary  classification.  As  in  Zhu  et  al.  (2003),  we  generate  50  training  data 
points  in  each  of  two  classes.  The  first  class  has  two  standard  normal  independent  inputs 
and  and  class  label  Y  =  —1.  The  second  class  also  has  two  standard  normal  independent 
inputs,  but  conditioned  on  4.5  <  (X^)'^  +  (X^)^  <  8  and  has  class  label  Y  =  1.  We  wish 
to  find  a  classification  rule  from  the  training  data,  so  that  when  given  a  new  input,  we  can 
assign  a  class  Y  from  {1,-1}  to  it. 

1-norm  SVM  (Zhu  et  al.  2003)  is  used  to  estimate  /3: 


n  m  5 

(/3o,/3)  =  argmin^(l  -  Yi{f3o  +  ^ /3jAj(Ai)))+  +  |/3j|  (22) 

i=l  j=l  j=l 

where  hi  G  D  are  basis  functions  and  A  is  the  regularization  parameter.  The  dictionary 
of  basis  functions  is  D  =  {\/2X^,\/2X^,\/2X^X'^,{X^)‘^,{X'^)'^}.  Notice  that  (3o  is  left 
unregularized  so  the  penalty  function  is  not  the  Li  penalty. 

The  fitted  model  is 

m 

f(x)  =  j3o  +  '^Pjhj{x), 
i=i 

and  the  classification  rule  is  given  by  sign(/(a:)). 
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Figure  3.  Estimates  of  1-norm  SVM  coefficients  /3j,  j=l,2 
classification  data.  Left  Panel  BLasso  solutions  as  a  function  of  t  =  J2j= 
Scatter  plot  of  the  data  points  with  labels:  ’-I-’  for  y  =  —1;  ’o’  for  y  =  1. 
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Since  the  loss  function  is  not  differentiable  and  the  penalty  function  is  not  the  original 
Lasso  penalty,  we  do  not  have  a  theoretical  guarantee  that  BLasso  works.  Nonetheless 
the  solution  path  produced  by  Generalized  BLasso  has  the  same  sparsity  and  piecewise 
linearality  as  the  1-norm  SVM  solutions  shown  in  Zhu  et  al.  (2003).  It  takes  Generalized 
BLasso  490  iterations  to  generate  the  solutions.  The  covariates  enter  the  regression  equation 
sequentially  as  t  increase,  in  the  following  order:  the  two  quadratic  terms  first,  followed  by 
the  interaction  term  then  the  two  linear  terms.  As  1-norm  SVM  in  Zhu  et  al.  (2003),  BLasso 
correctly  picked  up  the  quadratic  terms  early.  The  interaction  term  and  linear  terms  that 
are  not  in  the  true  model  comes  up  much  later.  In  other  words,  BLasso  results  are  in  good 
agreement  with  Zhu  et  al.’s  1-norm  SVM  results  and  we  regard  this  as  a  confirmation  for 
BLasso’s  effectiveness  in  this  nondifferentiable  example. 

9.  Discussion  and  Conciuding  Remarks 

As  seen  from  the  experiments,  BLasso  is  effective  for  solving  the  Lasso  problem  and  general 
convex  penalized  loss  minimization  problems.  One  practical  issue  left  undiscussed  is  the 
choice  of  the  step  size  e.  In  general,  BLasso  takes  0(l/e)  steps  to  produce  the  whole  path. 
Depending  on  the  actual  loss  function,  base  learners  and  minimization  method  used  in 
each  step,  the  actual  computation  complexity  varies.  Although  choosing  a  smaller  step  size 
gives  smoother  solution  path  and  more  accurate  estimates,  we  observe  that  the  the  actual 
coefficient  estimates  are  pretty  accurate  even  for  relatively  large  step  sizes. 


Figure  4.  Estimates  of  regression  coefficients  $3  for  the  diabetes  data.  Solutions  are 
plotted  as  functions  of  A.  Dotted  Line:  Estimates  using  step  size  e  =  0.05.  Solid  Line: 
Estimates  using  step  size  e  =  10.  Dash-dot  Line:  Estimates  using  step  size  e  =  50. 

For  the  diabetes  data,  using  a  small  step  size  e  =  0.05,  the  solution  path  can  not  be 
distinguished  from  the  exact  regularization  path.  However,  even  when  the  step  size  is  as 
large  as  e  =  10  and  e  =  50,  the  solutions  are  still  good  approximations. 

BLasso  has  only  one  step  size  parameter.  This  parameter  controls  both  how  close 
BLasso  approximates  the  minimization  coefficients  for  each  A  and  how  close  two  adjacent 
A  on  the  regularization  path  are  placed.  As  can  be  seen  from  Figure  4,  a  smaller  step  size 
leads  to  a  closer  approximation  to  the  solutions  and  also  finer  grids  for  A.  We  argue  that. 
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if  A  is  sampled  on  a  coarse  grid  we  should  not  spend  computational  power  on  finding  a 
much  more  accurate  approximation  of  the  coefficients  for  each  A.  Instead,  the  available 
computational  power  spent  on  these  two  coupled  tasks  should  be  balanced.  BLasso’s  1- 
parameter  setup  automatically  balances  these  two  aspects  of  the  approximation  which  is 
graphically  expressed  by  the  staircase  shape  of  the  solution  paths. 

Another  algorithm  similar  to  Generalized  BLasso  is  developed  independently  by  Rosset 
(2004).  There,  starting  from  A  =  0,  a  solution  is  generated  by  taking  a  small  Newton- 
Raphson  step  for  each  A,  then  A  is  increased  by  a  fixed  amount.  The  algorithm  assumes 
twice-differentiability  of  both  loss  function  and  penalty  function  and  involves  calculation  of 
the  Hessian  matrix.  It  works  in  a  very  similar  fashion  as  the  Barrier  Method  with  a  linear 
update  schedule  for  A  (Barrier  Method  uses  a  geometric  schedule).  The  Barrier  Method 
usually  uses  a  second  order  optimization  method  for  each  A  as  well.  In  comparison,  BLasso 
uses  only  the  differences  of  the  loss  function  and  involves  only  basic  operations.  BLasso’s 
step  size  is  defined  in  the  original  parameter  space  which  makes  the  solutions  evenly  spread 
in  /3’s  space  rather  than  in  A.  In  general,  since  A  is  approximately  the  reciprocal  of  size 
of  the  penalty,  as  fitted  model  grows  larger  and  A  becomes  smaller,  changing  A  by  a  fixed 
amount  makes  the  algorithm  in  Rosset  (2004)  move  too  fast  in  the  /?  space.  On  the  other 
hand,  when  the  model  is  close  to  empty  and  the  penalty  function  is  very  small,  A  is  very 
large,  but  the  algorithm  still  uses  same  small  steps  thus  computation  is  spent  to  generate 
solutions  that  are  too  close  from  each  other. 

One  of  our  current  research  topics  is  to  apply  BLasso  in  an  online  or  time  series  setting. 
Since  BLasso  has  both  forward  and  backward  steps,  we  believe  that  an  adaptive  online 
learning  algorithm  can  be  devised  based  on  BLasso  so  that  it  goes  back  and  forth  to  track 
the  best  regularization  parameter  and  the  corresponding  model. 

We  end  with  a  summary  of  our  main  contributions: 


(a)  By  combining  both  forward  and  backward  steps,  a  Boosted  Lasso  (BLasso)  algorithm 
is  constructed  to  minimize  an  Li  penalized  convex  loss  function.  While  it  maintains 
the  simplicity  and  flexibility  of  Boosting  (or  Forward  Stagewise  Fitting)  algorithms, 
BLasso  efficiently  produces  the  Lasso  solutions  for  general  loss  functions  and  large 
classes  of  base  learners.  This  can  be  proven  rigorously  under  the  assumption  that  the 
loss  function  is  convex  and  continuously  differentiable. 

(b)  The  backward  steps  introduced  in  this  paper  are  critical  for  producing  the  Lasso  path. 
Without  them,  the  FSF  algorithm  can  be  too  greedy  and  in  general  does  not  produce 
Lasso  solutions,  especially  when  the  base  learners  are  strongly  correlated  as  in  cases 
where  the  number  of  base  learners  is  larger  than  the  number  of  observations. 

(c)  When  the  loss  function  is  squared  loss,  BLasso  takes  a  simpler  and  more  intuitive 
form  and  depends  only  on  the  inner  products  of  variables.  For  problems  with  large 
number  of  observations  and  fewer  base  learners,  after  initialization,  BLasso  does  not 
depend  on  the  number  of  observations.  For  nonparametric  learning  problems  where 
previous  methods  do  not  cover  due  to  a  large  or  an  infinite  number  of  base  learners, 
BLasso  remains  computationally  efficient  since  the  forward  steps  are  boosting  steps 
and  the  backward  steps  deaf  onfy  with  base  learners  that  have  already  been  included 
in  the  model. 

(d)  We  generalize  BLasso  to  deal  with  other  convex  penalties  (generalized  BLasso)  and 
show  a  connection  with  the  Barrier  Method  for  constrained  convex  optimization. 
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A.  Appendix:  Proofs 

First,  we  offer  a  proof  for  Lemma  1. 

Proof.  (Lemma  1) 

(a)  Suppose  3A,  j,  |s|  =  e  s.t.  r(elj;  A)  <  r(0;  A).  We  have 

n  n 

Y,  L{Z,;  0)-Y  ^  -  ^'TiO), 

i=l 

therefore 

^  n  n 

A  <  -{YL{Z^;0)-YL{Z.■,sl,)} 

1^1 

^  n  n 

<  -{^L(Z,;0)-  ruin  ^L(Zi;slj)} 

^  i=i  ^’l*l=%=i 

-  n  n 

=  -{^L(Z,;0)-^L(Z,;/3O)} 

i—1  i—1 

=  A°. 

(b)  Since  a  backward  step  is  only  taken  when  r(/3*+^;  A*)  <  r(/3*;  A*),  so  we  only  need  to 
consider  forward  steps.  When  a  forward  step  is  forced,  if  r(/3*+^;  A*)  >  r(/3*;  A*),  then 

n  n 

Y  L{Z,-,  /3‘)  -  ^  L(Z,;  /3‘+i)  <  A‘T(/3‘+1)  -  A‘T(/3‘), 

i=l 

therefore 

-  n  n 

=  -{E  -  E 

i—1  i—1 

which  contradicts  the  assumption. 

(c)  Since  <  A*  and  A  can  not  be  relaxed  by  a  backward  step,  we  immediately  have 
||/9*''’^||i  =  11/3* 111  +  £•  Then  from 

-  n  n 

=  -{E  /5*)  -  ^  L(Z,;  /3‘+i)} 

i—1  i—1 

we  get 

r(/3*;A*+i)  =r(/3*+i;A*+i). 

Plus  both  sides  by  A*  —  times  the  penalty  term,  and  recall  T(/3‘+^)  =  ||/3*''"^||i  > 
|/3‘||i=T(/3‘),  weget 

r(/3*;A‘)  <  r(/3*+i;A*) 

=  min  r(/3*  +  sl^;  A*) 
i,|s|=« 

<  r(/3‘±el,;A*) 


for  Vj.  This  completes  the  proof. 
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Theorem  1  claims  “the  BLasso  path  converges  to  the  Lasso  path”,  by  which  we  mean: 

(a)  As  e  ^  0,  for  Vt  s.t.  <  A*,  r(/3*;  A*)  —  min^g  r(/3;  A‘)  =  o(l). 

(b)  For  each  e  >  0,  it  takes  finite  steps  to  run  BLasso. 

Proof.  (Theorem  1) 


(a)  Since  L(Z;/3)  is  strictly  convex  in  (3,  so  r(/3;A)  is  strictly  convex  in  (3  for  VA.  So 
r(/3;  A)  has  unique  minimum  P*{X).  For  j,  that  (3*{X)  yf  0,  F  is  differentiable  in  that 
direction  and  dT  =  0,  for  the  j  that  f3*{X)  =  0,  the  right  and  left  partial  derivative 
of  F  w.r.t.  f3j  are  both  bigger  than  or  equal  to  0  and  at  least  one  of  them  is  strictly 
bigger  than  0.  Now,  recall  Lemma  1  says  BLasso  does  steepest  coordinate  descent 
with  fixed  step  size  e.  We  want  to  show  it  does  not  get  stuck  near  a  nonstationary 
point  as  e  ^  0. 

Consider  at  /3*,  we  look  for  the  steepest  descent  on  the  surface  of  a  polytope  defined 
by  /3  =  /3*  +  A/3  where  |jA/3||i  =  e,  i.e. 

min  AF(/3‘  +  A/3;  A)  subject  to  |!A/3||i  =  e.  (23) 


Here 


AF  =  AL  +  XAT. 


Since  L  is  continuously  differentiable  w.r.t.  /3,  we  have 


j 


(24) 


And  since  T{f3)  =  |/3j|,  we  have 

Ar  =  ^sign+(/3‘,A/3,)||A/3,||.  (25) 

j 

Therefore  as  e  ^  0,  (23)  becomes  a  linear  programming  problem.  This  indicates, 
near  a  nonstationary  point,  if  there  is  a  descending  point  on  the  polytope  then  there 
is  a  descending  direction  on  the  vertices.  More  precisely,  if  3A/3  such  that  AF(/3*  + 
A/3;  A)  <  0  and  is  not  o(e),  then  there  exists  a  descending  direction  j.  This  contradicts 
with  Lemma  1(c). 

Therefore,  for  any  j,  either  AF(/3*  + A/3y;  A)  =  o(e)  or  it  is  not  o(e)  but  less  than  zero  in 
both  positive  and  negative  directions  where  the  later  corresponds  to  zero  entries  in  f3t. 
Thus,  by  the  continuous  differentiability  assumption,  Pt  is  in  the  o(l)  neighborhood 
of  P*{Xt)  which  implies  F(/3*;  A*)  —  F(/3*(A‘);  A*)  =  o(l).  This  concludes  the  proof. 

(b)  First,  suppose  we  have  A‘+^  <  A*,  A*  <  A*  and  t  <  t' .  Immediately,  we  have 
A*  >  A*',  then 

F(/3‘';  A*')  <  F(/3‘;  A*')  <  F(/3‘;  A‘)  <  F(/3‘';  A*). 

Therefore 

F(/3‘';  A‘)  -  F(/3*';  A*')  >  F(/3‘;  A*)  -  F(/3*;  A*'), 

from  which  we  get 

r(/3‘')  >  r(/3‘). 
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So  the  BLasso  solution  before  each  time  A  gets  relaxed  strictly  increases  in  Li  norm. 
Then  since  the  Li  norm  can  only  change  on  an  e-grid,  so  A  can  only  be  relaxed  finite 
times  till  BLasso  reaches  the  unregularized  solution. 

Now  for  each  value  of  A,  since  BLasso  is  always  strictly  descending,  the  BLasso  solu¬ 
tions  never  repeat.  By  the  same  e-grid  argument,  BLasso  can  only  take  finite  steps 
before  A  has  to  be  relaxed. 

Combining  the  two  arguments,  we  conclude  that  for  each  e  >  0  it  can  only  take  finite 
steps  to  run  BLasso. 
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